| Characteristics | Filter | Wrapper | Embedded |
|---|---|---|---|
| Avoids Overfitting | High | Low | Moderate |
| Computational Efficiency | High | Low | Moderate |
| Performance Accuracy | Low | High | Moderate |
| Ability to Handle High-Dimensional Data | High | Low | Moderate |
| Complexity | Low | High | Moderate |
| Model Dependency | No | Yes | Yes |
| Takes Into Account Feature Interactions | No | Yes | Yes |
| Prone to Select Redundant Features | High | Low | Moderate |
| Suitable for Smaller Datasets | Yes | No | Yes |
| Interpretability | High | Moderate | Low |
## Warning: package 'Boruta' was built under R version 4.2.3
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
## Warning: package 'glmnet' was built under R version 4.2.3
## Loading required package: Matrix
## Loaded glmnet 4.1-7
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'plotly' was built under R version 4.2.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## Warning: package 'knockoff' was built under R version 4.2.3
# Load the data
data <-
read.csv("I:/UBD PB/ZH5103 Predictive/Feature Selection/06_PPMI_ClassificationValidationData_Short.csv", stringsAsFactors = FALSE)
# Select only PD and Control cases and remove irrelevant columns
data_filtered <- data %>%
filter(ResearchGroup %in% c('PD', 'Control')) %>%
select(-c(X, FID_IID, VisitID))
# Convert appropriate columns to factors or numerical values as necessary
data_filtered$ResearchGroup <- as.factor(data_filtered$ResearchGroup)set.seed(123)
# Prepare the predictor and response variables
x <- data_filtered[, -which(names(data_filtered) == "ResearchGroup")]
y <- data_filtered$ResearchGroupIn this approach, the dataset is split into two parts: x and y. The x variable is a data frame containing the independent variables, while y is a vector or factor representing the dependent variable. This method is particularly useful when x and y have been preprocessed or filtered in advance.
# Apply Boruta for feature selection
boruta_result1 <- Boruta(as.data.frame(x), y, doTrace = 2, maxRuns = 100, pValue = 0.05)
boruta_result2 <- Boruta(as.data.frame(x), y, doTrace = 2, maxRuns = 50, pValue = 0.05)
boruta_result3 <- Boruta(as.data.frame(x), y, doTrace = 2, maxRuns = 100, pValue = 0.01)
boruta_result4 <- Boruta(as.data.frame(x), y, doTrace = 2, maxRuns = 50, pValue = 0.01)
# Print the results
print(boruta_result1)## Boruta performed 99 iterations in 12.18746 secs.
## 24 attributes confirmed important: Age, chr17_rs199533_DP,
## UPDRS_Part_I_Summary_Score_Baseline,
## UPDRS_Part_I_Summary_Score_Month_03,
## UPDRS_Part_I_Summary_Score_Month_06 and 19 more;
## 73 attributes confirmed unimportant: chr12_rs34637584_GT,
## chr17_rs11012_GT, chr17_rs11868035_GT, chr17_rs12185268_GT,
## chr17_rs199533_GQ and 68 more;
## 3 tentative attributes left:
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline,
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24,
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24;
## Boruta performed 49 iterations in 5.475048 secs.
## 23 attributes confirmed important: Age,
## UPDRS_Part_I_Summary_Score_Baseline,
## UPDRS_Part_I_Summary_Score_Month_03,
## UPDRS_Part_I_Summary_Score_Month_06,
## UPDRS_Part_I_Summary_Score_Month_09 and 18 more;
## 71 attributes confirmed unimportant: chr12_rs34637584_GT,
## chr17_rs11012_GT, chr17_rs11868035_GT, chr17_rs12185268_GT,
## chr17_rs199533_GT and 66 more;
## 6 tentative attributes left: chr17_rs199533_DP, chr17_rs199533_GQ,
## L_putamen_ComputeArea,
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06,
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24
## and 1 more;
## Boruta performed 99 iterations in 10.91136 secs.
## 24 attributes confirmed important: Age, chr17_rs199533_DP,
## UPDRS_Part_I_Summary_Score_Baseline,
## UPDRS_Part_I_Summary_Score_Month_03,
## UPDRS_Part_I_Summary_Score_Month_06 and 19 more;
## 74 attributes confirmed unimportant: chr12_rs34637584_GT,
## chr17_rs11012_GT, chr17_rs11868035_GT, chr17_rs12185268_GT,
## chr17_rs199533_GQ and 69 more;
## 2 tentative attributes left:
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24,
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24;
## Boruta performed 49 iterations in 6.599455 secs.
## 22 attributes confirmed important: Age,
## UPDRS_Part_I_Summary_Score_Month_03,
## UPDRS_Part_I_Summary_Score_Month_06,
## UPDRS_Part_I_Summary_Score_Month_09,
## UPDRS_Part_I_Summary_Score_Month_12 and 17 more;
## 72 attributes confirmed unimportant: chr12_rs34637584_GT,
## chr17_rs11012_GT, chr17_rs11868035_GT, chr17_rs12185268_GT,
## chr17_rs199533_GQ and 67 more;
## 6 tentative attributes left: chr17_rs199533_DP,
## L_insular_cortex_ComputeArea, L_putamen_Volume,
## UPDRS_Part_I_Summary_Score_Baseline,
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24
## and 1 more;
The doTrace parameter in Boruta controls the verbosity of the algorithm’s output. Setting it to 2 provides detailed output, which is beneficial for debugging, monitoring progress, and understanding the algorithm’s behavior.
Number of Runs (maxRuns): Models with 100 iterations (boruta_result1 and boruta_result3) had fewer tentative features compared to those with 50 iterations. More iterations seem to provide a more stable feature selection.
Significance Level (pValue): A stricter p-value (0.01) led to a more conservative selection of important features(boruta_result3 and boruta_result4). The p-value setting influences how easily a feature is deemed important.
Important Features: While there’s a core set of important features that appear in all runs, some features shift categories depending on the algorithm’s parameters.
Tentative Features: Fewer tentative features were observed with more iterations and a higher p-value. These settings make the algorithm more decisive.
Unimportant Features: The count of unimportant features remained fairly consistent across different runs, suggesting a stable set of unimportant features.
Summary: Parameter tuning in Boruta significantly impacts feature selection. The number of iterations and the p-value setting can influence the stability and conservatism of the selected features, highlighting the importance of careful parameter selection.
## L_insular_cortex_AvgMeanCurvature L_insular_cortex_ComputeArea
## [1,] 1.4575722 0.3168555
## [2,] 2.0179328 2.5020619
## [3,] 0.1184280 1.4886066
## [4,] -0.7848265 -0.2750471
## [5,] 2.0766387 1.4489491
## [6,] 2.4056298 0.8810963
## L_insular_cortex_Volume L_insular_cortex_ShapeIndex
## [1,] 2.4018065 1.6499661
## [2,] 0.1333169 2.3795178
## [3,] 1.1098431 1.0020723
## [4,] 0.6386626 -1.1366438
## [5,] 1.6519704 0.2740456
## [6,] -1.6893010 0.1603955
## L_insular_cortex_Curvedness R_insular_cortex_AvgMeanCurvature
## [1,] 1.8500368 1.6249561
## [2,] -0.4179922 1.6114488
## [3,] 3.0405062 2.0846537
## [4,] 2.1538222 0.2770928
## [5,] 1.3090330 1.5141512
## [6,] 0.3656473 -0.4236452
## R_insular_cortex_ComputeArea R_insular_cortex_Volume
## [1,] 0.07125802 2.1872772
## [2,] 1.65758737 2.4094544
## [3,] 0.61115320 1.9471140
## [4,] 1.97862467 -0.6040013
## [5,] 2.23920608 1.3539884
## [6,] 1.41345687 2.0290873
## R_insular_cortex_ShapeIndex R_insular_cortex_Curvedness
## [1,] 0.5482724 1.4390631
## [2,] 1.0248102 1.6983896
## [3,] 1.5654154 0.2684312
## [4,] 2.4536535 0.8039166
## [5,] 0.1376668 1.1090759
## [6,] 0.5859904 1.7066907
## L_cingulate_gyrus_AvgMeanCurvature L_cingulate_gyrus_ComputeArea
## [1,] 2.08498585 0.2139784
## [2,] 2.08738020 1.2044086
## [3,] 0.24167189 0.7094790
## [4,] 0.08894282 -0.2040811
## [5,] -0.12749274 -0.2194260
## [6,] 1.61085884 0.7939635
## L_cingulate_gyrus_Volume L_cingulate_gyrus_ShapeIndex
## [1,] 2.4527342 0.8047009
## [2,] 1.6463114 0.1504341
## [3,] 1.6602767 0.8699307
## [4,] 1.4653188 1.9493496
## [5,] 1.7149580 -0.1985188
## [6,] 0.4547431 0.3558922
## L_cingulate_gyrus_Curvedness R_cingulate_gyrus_AvgMeanCurvature
## [1,] 0.5158144 1.1799982
## [2,] 0.6916040 1.3841092
## [3,] 0.8243083 1.4402425
## [4,] 1.8005817 0.1481411
## [5,] 0.4060693 2.0008725
## [6,] -1.0842008 2.0072019
## R_cingulate_gyrus_ComputeArea R_cingulate_gyrus_Volume
## [1,] 1.055216 1.67531038
## [2,] 1.161354 -0.01808199
## [3,] 1.335174 0.12925296
## [4,] 1.981630 0.58196925
## [5,] 1.236868 0.93874859
## [6,] 1.434090 -0.03696876
## R_cingulate_gyrus_ShapeIndex R_cingulate_gyrus_Curvedness
## [1,] 1.5656802 1.4751285
## [2,] 1.0879341 2.1581460
## [3,] 0.3246879 0.7685935
## [4,] -1.2551747 -1.2448449
## [5,] -1.1613158 2.2585246
## [6,] 1.1108282 0.4668317
## L_caudate_AvgMeanCurvature L_caudate_ComputeArea L_caudate_Volume
## [1,] 0.3689286 2.623999 1.8575358
## [2,] 0.7045198 1.215581 1.8352289
## [3,] 1.0477284 2.272598 -1.3157016
## [4,] 1.4007975 0.717974 0.5544148
## [5,] 1.4199532 1.291681 1.1975330
## [6,] 1.3053149 2.340253 1.2104920
## L_caudate_ShapeIndex L_caudate_Curvedness R_caudate_AvgMeanCurvature
## [1,] 0.7893853 0.6320394 0.3549448
## [2,] -0.1658701 1.5923169 -0.1224016
## [3,] 1.8184932 0.9017798 0.1163604
## [4,] 1.4788492 0.1042350 1.1324841
## [5,] 0.6108149 0.7955619 0.5988750
## [6,] 1.8985765 1.6016947 -0.7307965
## R_caudate_ComputeArea R_caudate_Volume R_caudate_ShapeIndex
## [1,] 1.5021563 1.7947322 -1.0009349
## [2,] 1.8073001 1.4099976 0.9488633
## [3,] 1.5559798 1.9075759 3.1657444
## [4,] 0.1000046 0.2276313 1.7399885
## [5,] 1.1630013 0.4793466 -0.6413710
## [6,] 0.4797352 1.6881938 -1.2991073
## R_caudate_Curvedness L_putamen_AvgMeanCurvature L_putamen_ComputeArea
## [1,] 2.4573585 2.1035269 0.8134767
## [2,] 0.7753894 0.8868427 2.7835019
## [3,] 1.4681141 1.4381819 0.8212851
## [4,] -0.3063782 -0.4042961 0.5646813
## [5,] 2.5236557 2.0152844 1.5797997
## [6,] 1.1690431 0.7285983 1.8369029
## L_putamen_Volume L_putamen_ShapeIndex L_putamen_Curvedness
## [1,] 0.3278887 0.4631668 1.1401877
## [2,] 0.9470925 0.5139027 0.2051307
## [3,] 3.2912181 1.2372445 -0.4375165
## [4,] 0.4905340 -0.7217263 2.1662945
## [5,] 1.6526505 2.1943070 2.3038511
## [6,] 1.7938517 0.5636302 1.5560349
## R_putamen_AvgMeanCurvature R_putamen_ComputeArea R_putamen_Volume
## [1,] 2.137838 0.8863705 1.2519663
## [2,] 1.481967 -0.8380033 0.3910383
## [3,] 1.674552 -0.4066102 2.2495319
## [4,] 2.146203 0.9084034 2.3003726
## [5,] 2.206965 -0.2368738 0.9881542
## [6,] 1.000981 1.4926063 1.7780447
## R_putamen_ShapeIndex R_putamen_Curvedness L_hippocampus_AvgMeanCurvature
## [1,] -1.70660470 1.3380997 1.65285114
## [2,] -1.28337806 0.6696556 -0.86254872
## [3,] -0.02917993 1.9272518 0.03356209
## [4,] 0.55901892 -0.6449057 0.81830839
## [5,] 1.16657167 0.1263396 0.74227838
## [6,] 1.44749417 1.1763315 1.90216463
## L_hippocampus_ComputeArea L_hippocampus_Volume L_hippocampus_ShapeIndex
## [1,] 0.8479429 0.5432731 -1.2815905
## [2,] 0.2116243 -0.3646632 1.6559179
## [3,] 1.1740793 1.3666260 0.8870863
## [4,] 0.4314535 0.7420072 1.7380902
## [5,] 0.8862059 -0.1380858 -1.6387995
## [6,] 1.6253748 1.0810178 1.3717131
## L_hippocampus_Curvedness R_hippocampus_AvgMeanCurvature
## [1,] 0.01034205 1.4644988
## [2,] 0.13525363 -0.3566982
## [3,] -0.78574302 1.5668742
## [4,] 0.99776894 1.3360596
## [5,] 1.61132280 1.4418039
## [6,] -0.21483956 0.1805512
## R_hippocampus_ComputeArea R_hippocampus_Volume R_hippocampus_ShapeIndex
## [1,] 1.0938517 1.0984651 -1.6948827
## [2,] 1.1139730 -0.9200334 1.6390966
## [3,] 0.4097912 0.7119585 1.4684664
## [4,] 0.9873384 0.2444641 -0.4524415
## [5,] 0.3861019 1.1213706 1.3789829
## [6,] 2.2379298 1.1829867 0.6276756
## R_hippocampus_Curvedness L_fusiform_gyrus_AvgMeanCurvature
## [1,] 2.1922616 1.336187
## [2,] -0.2668791 1.497511
## [3,] 1.6390541 1.583220
## [4,] -0.3545161 1.263418
## [5,] 1.5652478 1.405847
## [6,] 1.4483447 2.097670
## L_fusiform_gyrus_ComputeArea L_fusiform_gyrus_Volume
## [1,] 0.7581712 0.5597572
## [2,] -0.1391539 1.0878979
## [3,] 2.2270189 2.7015835
## [4,] 0.2263073 0.3822766
## [5,] 1.4221860 -0.2225678
## [6,] 0.5526149 -0.2029002
## L_fusiform_gyrus_ShapeIndex L_fusiform_gyrus_Curvedness
## [1,] -0.1453741874 0.3812049
## [2,] 0.0006061347 2.5977577
## [3,] -1.7311577903 1.2834805
## [4,] 0.9806522959 1.2491067
## [5,] 1.9847536790 0.0574726
## [6,] 1.5348939217 0.1554053
## R_fusiform_gyrus_AvgMeanCurvature R_fusiform_gyrus_ComputeArea
## [1,] -0.6306410 0.6832405
## [2,] 0.6487849 1.5985719
## [3,] 0.1574757 2.6263074
## [4,] -0.9139837 1.0468473
## [5,] 1.2669879 0.4960261
## [6,] 0.9020351 1.4863491
## R_fusiform_gyrus_Volume R_fusiform_gyrus_ShapeIndex
## [1,] 0.97447236 0.6356051
## [2,] 1.65637557 0.3785380
## [3,] 2.15827940 2.5009249
## [4,] -0.01667523 0.6659844
## [5,] 1.75851739 0.1245194
## [6,] -0.54517791 2.3454982
## R_fusiform_gyrus_Curvedness Sex Weight Age
## [1,] 2.105472 0.6270524 0.3666273 3.549330
## [2,] -1.033233 1.0010015 1.0123159 2.371825
## [3,] 1.256086 -1.0010015 -0.2173583 4.100561
## [4,] 1.253101 0.7011082 1.4890379 2.334168
## [5,] 1.478845 1.5915984 1.7489921 2.537042
## [6,] 1.137965 -0.5585948 -0.5470142 2.879319
## chr12_rs34637584_GT chr17_rs11868035_GT chr17_rs11012_GT chr17_rs393152_GT
## [1,] 0 0.6965356 1.0010015 2.34587103
## [2,] 0 1.3343508 0.0420519 0.40306416
## [3,] 0 0.4181738 0.5135048 1.07185191
## [4,] 0 0.4275806 0.5585466 0.55089982
## [5,] 0 0.1938591 0.4329720 0.05920518
## [6,] 0 1.4000793 1.0010015 -0.18396351
## chr17_rs12185268_GT chr17_rs199533_GT chr17_rs199533_DP chr17_rs199533_GQ
## [1,] 0.1367317 1.0432697 3.125936 2.1181380
## [2,] -1.3621627 0.8236314 2.440333 1.0764309
## [3,] -1.1315971 0.6179382 3.558067 0.5101253
## [4,] 0.7757047 0.5893902 3.576184 2.2253168
## [5,] 2.0063510 1.0687291 3.390226 2.4074257
## [6,] 0.3036136 1.0010015 2.745027 3.0115024
## UPDRS_Part_I_Summary_Score_Baseline UPDRS_Part_I_Summary_Score_Month_03
## [1,] 2.800155 5.844546
## [2,] 3.904411 6.565039
## [3,] 2.360549 6.286197
## [4,] 3.325341 5.406509
## [5,] 3.472736 5.990894
## [6,] 2.484510 5.546972
## UPDRS_Part_I_Summary_Score_Month_06 UPDRS_Part_I_Summary_Score_Month_09
## [1,] 3.542529 5.703211
## [2,] 3.888444 4.995424
## [3,] 2.999376 5.163940
## [4,] 4.838372 5.661789
## [5,] 4.635391 6.142473
## [6,] 4.143960 5.262097
## UPDRS_Part_I_Summary_Score_Month_12 UPDRS_Part_I_Summary_Score_Month_18
## [1,] 4.853104 4.864344
## [2,] 5.437347 5.519004
## [3,] 6.360658 5.077066
## [4,] 5.663594 5.250520
## [5,] 5.476374 5.454942
## [6,] 5.233512 5.051124
## UPDRS_Part_I_Summary_Score_Month_24
## [1,] 4.417988
## [2,] 3.935287
## [3,] 3.750182
## [4,] 3.749612
## [5,] 3.883048
## [6,] 3.947062
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## [1,] 13.65763
## [2,] 14.33658
## [3,] 13.78602
## [4,] 15.52539
## [5,] 13.91585
## [6,] 14.87217
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## [1,] 7.404620
## [2,] 8.434361
## [3,] 8.324734
## [4,] 7.931181
## [5,] 8.023673
## [6,] 7.631500
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06
## [1,] 5.815875
## [2,] 5.940544
## [3,] 4.754393
## [4,] 6.257243
## [5,] 5.719864
## [6,] 5.868327
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09
## [1,] 6.996387
## [2,] 6.649270
## [3,] 5.786236
## [4,] 6.931409
## [5,] 5.909766
## [6,] 6.590126
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## [1,] 14.34946
## [2,] 14.54940
## [3,] 13.46194
## [4,] 13.28313
## [5,] 15.41348
## [6,] 14.70630
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18
## [1,] 6.455189
## [2,] 6.072463
## [3,] 6.375701
## [4,] 6.247198
## [5,] 6.112595
## [6,] 6.845006
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## [1,] 12.13650
## [2,] 13.37313
## [3,] 11.48894
## [4,] 11.82539
## [5,] 10.60863
## [6,] 11.62926
## UPDRS_Part_III_Summary_Score_Baseline
## [1,] 20.61729
## [2,] 19.67815
## [3,] 19.69573
## [4,] 18.32255
## [5,] 20.17695
## [6,] 19.86534
## UPDRS_Part_III_Summary_Score_Month_03
## [1,] 7.186981
## [2,] 8.042307
## [3,] 7.551345
## [4,] 7.347274
## [5,] 7.799312
## [6,] 8.194303
## UPDRS_Part_III_Summary_Score_Month_06
## [1,] 7.145747
## [2,] 7.740026
## [3,] 6.940100
## [4,] 6.890313
## [5,] 7.751139
## [6,] 7.986712
## UPDRS_Part_III_Summary_Score_Month_09
## [1,] 7.750192
## [2,] 7.557266
## [3,] 7.355727
## [4,] 7.998060
## [5,] 8.203538
## [6,] 8.149958
## UPDRS_Part_III_Summary_Score_Month_12
## [1,] 16.94946
## [2,] 17.43622
## [3,] 16.85379
## [4,] 19.01734
## [5,] 16.05588
## [6,] 16.55986
## UPDRS_Part_III_Summary_Score_Month_18
## [1,] 7.621021
## [2,] 7.625016
## [3,] 8.598616
## [4,] 7.954651
## [5,] 7.591975
## [6,] 6.585397
## UPDRS_Part_III_Summary_Score_Month_24
## [1,] 17.53584
## [2,] 17.26039
## [3,] 15.92304
## [4,] 16.89378
## [5,] 17.43966
## [6,] 17.49056
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline
## [1,] -0.2304422
## [2,] 1.6950085
## [3,] 1.0654703
## [4,] 1.0260148
## [5,] 2.2002883
## [6,] 1.7386536
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06
## [1,] 2.486879
## [2,] 1.212668
## [3,] 0.738164
## [4,] -0.825885
## [5,] 1.086697
## [6,] 2.162564
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_12
## [1,] 2.3456883
## [2,] 1.7983469
## [3,] 0.8308222
## [4,] 0.9820227
## [5,] -0.3432599
## [6,] 2.6871386
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24
## [1,] 3.2483354
## [2,] 1.1968567
## [3,] 0.5605115
## [4,] 2.3996317
## [5,] 2.2127860
## [6,] 2.0537891
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline
## [1,] -0.6394459
## [2,] 2.6852313
## [3,] -2.0702201
## [4,] 1.5567095
## [5,] 1.6286756
## [6,] 2.2628767
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## [1,] 4.625964
## [2,] 4.791965
## [3,] 5.951229
## [4,] 5.311131
## [5,] 5.503517
## [6,] 4.888696
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_12
## [1,] 1.7104838
## [2,] 1.8925122
## [3,] 0.9781759
## [4,] 1.0839147
## [5,] -0.8450536
## [6,] 0.2634916
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24
## [1,] 0.9649447
## [2,] 1.6866871
## [3,] 3.7405175
## [4,] 2.8436060
## [5,] 2.5279789
## [6,] 2.3242818
## shadowMax shadowMean shadowMin
## [1,] 2.395262 -0.148371906 -2.140496
## [2,] 1.764680 -0.076774324 -2.436643
## [3,] 3.184932 -0.062144476 -2.535188
## [4,] 2.257912 -0.056584422 -2.345969
## [5,] 2.148210 -0.097621473 -2.296801
## [6,] 2.000556 -0.001422071 -2.329772
df_long <- tidyr::gather(as.data.frame(boruta_result3$ImpHistory), feature, measurement)
plot_ly(df_long, x=~feature, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots",
xaxis = list(title="Features", categoryorder = "total descending"),
yaxis = list(title="Importance"), showlegend=F)## Boruta performed 99 iterations in 10.91136 secs.
## Tentatives roughfixed over the last 99 iterations.
## 24 attributes confirmed important: Age, chr17_rs199533_DP,
## UPDRS_Part_I_Summary_Score_Baseline,
## UPDRS_Part_I_Summary_Score_Month_03,
## UPDRS_Part_I_Summary_Score_Month_06 and 19 more;
## 76 attributes confirmed unimportant: chr12_rs34637584_GT,
## chr17_rs11012_GT, chr17_rs11868035_GT, chr17_rs12185268_GT,
## chr17_rs199533_GQ and 71 more;
## L_insular_cortex_AvgMeanCurvature
## Rejected
## L_insular_cortex_ComputeArea
## Rejected
## L_insular_cortex_Volume
## Rejected
## L_insular_cortex_ShapeIndex
## Rejected
## L_insular_cortex_Curvedness
## Rejected
## R_insular_cortex_AvgMeanCurvature
## Rejected
## R_insular_cortex_ComputeArea
## Rejected
## R_insular_cortex_Volume
## Rejected
## R_insular_cortex_ShapeIndex
## Rejected
## R_insular_cortex_Curvedness
## Rejected
## L_cingulate_gyrus_AvgMeanCurvature
## Rejected
## L_cingulate_gyrus_ComputeArea
## Rejected
## L_cingulate_gyrus_Volume
## Rejected
## L_cingulate_gyrus_ShapeIndex
## Rejected
## L_cingulate_gyrus_Curvedness
## Rejected
## R_cingulate_gyrus_AvgMeanCurvature
## Rejected
## R_cingulate_gyrus_ComputeArea
## Rejected
## R_cingulate_gyrus_Volume
## Rejected
## R_cingulate_gyrus_ShapeIndex
## Rejected
## R_cingulate_gyrus_Curvedness
## Rejected
## L_caudate_AvgMeanCurvature
## Rejected
## L_caudate_ComputeArea
## Rejected
## L_caudate_Volume
## Rejected
## L_caudate_ShapeIndex
## Rejected
## L_caudate_Curvedness
## Rejected
## R_caudate_AvgMeanCurvature
## Rejected
## R_caudate_ComputeArea
## Rejected
## R_caudate_Volume
## Rejected
## R_caudate_ShapeIndex
## Rejected
## R_caudate_Curvedness
## Rejected
## L_putamen_AvgMeanCurvature
## Rejected
## L_putamen_ComputeArea
## Rejected
## L_putamen_Volume
## Rejected
## L_putamen_ShapeIndex
## Rejected
## L_putamen_Curvedness
## Rejected
## R_putamen_AvgMeanCurvature
## Rejected
## R_putamen_ComputeArea
## Rejected
## R_putamen_Volume
## Rejected
## R_putamen_ShapeIndex
## Rejected
## R_putamen_Curvedness
## Rejected
## L_hippocampus_AvgMeanCurvature
## Rejected
## L_hippocampus_ComputeArea
## Rejected
## L_hippocampus_Volume
## Rejected
## L_hippocampus_ShapeIndex
## Rejected
## L_hippocampus_Curvedness
## Rejected
## R_hippocampus_AvgMeanCurvature
## Rejected
## R_hippocampus_ComputeArea
## Rejected
## R_hippocampus_Volume
## Rejected
## R_hippocampus_ShapeIndex
## Rejected
## R_hippocampus_Curvedness
## Rejected
## L_fusiform_gyrus_AvgMeanCurvature
## Rejected
## L_fusiform_gyrus_ComputeArea
## Rejected
## L_fusiform_gyrus_Volume
## Rejected
## L_fusiform_gyrus_ShapeIndex
## Rejected
## L_fusiform_gyrus_Curvedness
## Rejected
## R_fusiform_gyrus_AvgMeanCurvature
## Rejected
## R_fusiform_gyrus_ComputeArea
## Rejected
## R_fusiform_gyrus_Volume
## Rejected
## R_fusiform_gyrus_ShapeIndex
## Rejected
## R_fusiform_gyrus_Curvedness
## Rejected
## Sex
## Rejected
## Weight
## Rejected
## Age
## Confirmed
## chr12_rs34637584_GT
## Rejected
## chr17_rs11868035_GT
## Rejected
## chr17_rs11012_GT
## Rejected
## chr17_rs393152_GT
## Rejected
## chr17_rs12185268_GT
## Rejected
## chr17_rs199533_GT
## Rejected
## chr17_rs199533_DP
## Confirmed
## chr17_rs199533_GQ
## Rejected
## UPDRS_Part_I_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_24
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## Confirmed
## UPDRS_Part_III_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_24
## Confirmed
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline
## Rejected
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06
## Rejected
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_12
## Rejected
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24
## Rejected
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline
## Rejected
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## Confirmed
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_12
## Rejected
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24
## Rejected
## Levels: Tentative Confirmed Rejected
## Age
## Confirmed
## chr17_rs199533_DP
## Confirmed
## UPDRS_Part_I_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_24
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## Confirmed
## UPDRS_Part_III_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_24
## Confirmed
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## Confirmed
## Levels: Tentative Confirmed Rejected
## Age
## Confirmed
## chr17_rs199533_DP
## Confirmed
## UPDRS_Part_I_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_I_Summary_Score_Month_24
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## Confirmed
## UPDRS_Part_III_Summary_Score_Baseline
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_03
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_06
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_09
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_12
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_18
## Confirmed
## UPDRS_Part_III_Summary_Score_Month_24
## Confirmed
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## Confirmed
## Levels: Tentative Confirmed Rejected
## [1] 24
# Calculate the mean importance for each feature
mean_importance <- colMeans(boruta_result3$ImpHistory[1:6, ], na.rm = TRUE)
# Sort features by importance
sorted_features <- sort(mean_importance)
# Get the names of the top 6 features
top_features <- tail(sorted_features, 6)
# Get the names of the bottom 6 features
bottom_features <- head(sorted_features, 6)
# Combine the results
list(
Top_6_Features = top_features,
Bottom_6_Features = bottom_features
)## $Top_6_Features
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## 11.84364
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## 14.29395
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## 14.34894
## UPDRS_Part_III_Summary_Score_Month_24
## 17.09054
## UPDRS_Part_III_Summary_Score_Month_12
## 17.14543
## UPDRS_Part_III_Summary_Score_Baseline
## 19.72600
##
## $Bottom_6_Features
## shadowMin shadowMean
## -2.34747800 -0.07381978
## chr12_rs34637584_GT R_putamen_ShapeIndex
## 0.00000000 0.02565368
## chr17_rs12185268_GT R_caudate_AvgMeanCurvature
## 0.12144021 0.22491103
Summary: Based on decisiveness, stability, rigor, and consistency, boruta_result3 stands out as the best model. It has the fewest tentative features, underwent 99 iterations, and used a stricter p-value, offering a more reliable and rigorous feature selection process.
Here’s a summarized table to easily compare the four Boruta results:
| Model | Iterations | Important Features | Unimportant Features | Tentative Features | p-Value |
|---|---|---|---|---|---|
| boruta_result1 | 99 | 24 | 73 | 3 | 0.05 |
| boruta_result2 | 49 | 23 | 71 | 6 | 0.05 |
| boruta_result3 | 99 | 24 | 74 | 2 | 0.01 |
| boruta_result4 | 49 | 22 | 72 | 6 | 0.01 |
Key Takeaways: Decisiveness: boruta_result3 has the lowest number of tentative features (2).
Stability: 99 iterations contribute to a more stable feature selection.
Rigor: A stricter p-value of 0.01 adds rigor to boruta_result3.
Consistency: This model aligns with boruta_result1 which had the same number of iterations, adding to its reliability.
In the Boruta feature selection process, 24 variables were deemed important, with UPDRS scores at various time points standing out as the most significant. These 24 variables provide a comprehensive set of features for building an accurate predictive model for Parkinson’s Disease.
The top 6 features are primarily UPDRS scores at various time points, indicating their high predictive power for Parkinson’s Disease. Their importance values range from about 11.84 to 19.73. On the other hand, the least important features include shadow variables and specific genomic markers like chr12_rs34637584_GT. These low-importance features are likely not useful for the model and may serve as noise rather than informative predictors.
set.seed(123)
# Define the control using a repeatedcv method
control <- rfeControl(functions = rfFuncs, method = "repeatedcv", repeats = 10)
# Apply RFE with a Random Forest model
results_rfe <- rfe(x, y, sizes=c(0:30,35,40,45,50), rfeControl = control)
# Summarize results
print(results_rfe)##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 10 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 0 0.9727 0.9307 0.02041 0.05185
## 1 0.9730 0.9314 0.02031 0.05163
## 2 0.9827 0.9560 0.01509 0.03888
## 3 0.9867 0.9661 0.01300 0.03351
## 4 0.9865 0.9658 0.01247 0.03198
## 5 0.9860 0.9645 0.01356 0.03481
## 6 0.9852 0.9623 0.01440 0.03695
## 7 0.9872 0.9673 0.01318 0.03396
## 8 0.9888 0.9715 0.01234 0.03174
## 9 0.9898 0.9740 0.01229 0.03159
## 10 0.9903 0.9752 0.01258 0.03254
## 11 0.9907 0.9761 0.01217 0.03133
## 12 0.9910 0.9770 0.01146 0.02942 *
## 13 0.9907 0.9761 0.01170 0.03014
## 14 0.9902 0.9748 0.01162 0.02996
## 15 0.9900 0.9744 0.01160 0.02989
## 16 0.9895 0.9732 0.01126 0.02898
## 17 0.9895 0.9731 0.01198 0.03115
## 18 0.9890 0.9718 0.01212 0.03150
## 19 0.9892 0.9723 0.01192 0.03072
## 20 0.9890 0.9718 0.01211 0.03149
## 21 0.9895 0.9731 0.01125 0.02898
## 22 0.9887 0.9710 0.01203 0.03122
## 23 0.9892 0.9723 0.01168 0.03013
## 24 0.9893 0.9727 0.01171 0.03036
## 25 0.9888 0.9714 0.01231 0.03185
## 26 0.9890 0.9718 0.01211 0.03129
## 27 0.9892 0.9722 0.01168 0.03018
## 28 0.9888 0.9714 0.01184 0.03053
## 29 0.9892 0.9722 0.01144 0.02956
## 30 0.9895 0.9731 0.01175 0.03030
## 35 0.9895 0.9731 0.01127 0.02913
## 40 0.9887 0.9709 0.01226 0.03180
## 45 0.9895 0.9731 0.01151 0.02974
## 50 0.9893 0.9727 0.01147 0.02966
## 100 0.9887 0.9709 0.01226 0.03180
##
## The top 5 variables (out of 12):
## UPDRS_Part_III_Summary_Score_Baseline, UPDRS_Part_III_Summary_Score_Month_12, UPDRS_Part_III_Summary_Score_Month_24, UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline, UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
The table shows RFE model performance metrics for various numbers of
selected features. The model with 12 selected features stands out as the
best, marked with an asterisk, achieving an accuracy of 99.10% and a
Kappa score of 0.9770. The models generally perform well, even with
fewer features, but peak performance is observed with 12 features. This
suggests that including 12 features in the model would likely yield the
best predictive results.
# Extract selected features from Boruta
boruta_selected <- getSelectedAttributes(boruta_result3)
# Extract selected features from RFE
rfe_selected <- results_rfe$optVariables
# Check the overlap
common_features <- intersect(boruta_selected, rfe_selected)
# Print common features
print(common_features)## [1] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline"
## [2] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03"
## [3] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12"
## [4] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18"
## [5] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24"
## [6] "UPDRS_Part_III_Summary_Score_Baseline"
## [7] "UPDRS_Part_III_Summary_Score_Month_03"
## [8] "UPDRS_Part_III_Summary_Score_Month_06"
## [9] "UPDRS_Part_III_Summary_Score_Month_09"
## [10] "UPDRS_Part_III_Summary_Score_Month_12"
## [11] "UPDRS_Part_III_Summary_Score_Month_18"
## [12] "UPDRS_Part_III_Summary_Score_Month_24"
## [1] 12
The 12 features selected by both Boruta and RFE focus on UPDRS scores at various time points, indicating their likely importance in the model. While Boruta selected an additional 12 features, it suggests either higher sensitivity or that these features have borderline importance. The overlap between the two methods lends credibility to the significance of the 12 common features.
# Fit LASSO logistic regression model
cvfit <- cv.glmnet(as.matrix(x), y, family = "binomial", alpha = 1)
# Extract non-zero coefficients
lasso_coef <- as.matrix(coef(cvfit, s = cvfit$lambda.min))
# Select variables with non-zero coefficients along with their values
lasso_selected_with_coef <- lasso_coef[which(lasso_coef != 0), , drop = FALSE]
lasso_selected_with_coef[-1,] ## R_insular_cortex_ShapeIndex
## 1.768393e+01
## R_cingulate_gyrus_AvgMeanCurvature
## 2.160157e+00
## R_cingulate_gyrus_Volume
## -3.197724e-05
## L_caudate_Volume
## 3.937939e-04
## R_putamen_Volume
## -2.447192e-04
## L_hippocampus_ShapeIndex
## -1.879605e+00
## R_hippocampus_AvgMeanCurvature
## 2.236332e+00
## R_fusiform_gyrus_ShapeIndex
## 5.954482e+00
## Sex
## 8.409662e-01
## Age
## -2.204520e-02
## chr17_rs393152_GT
## -5.578087e-01
## UPDRS_Part_I_Summary_Score_Month_09
## -5.041681e-01
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## 1.778982e-01
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## -2.588251e-02
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## 1.220967e-01
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## 3.561164e-01
## UPDRS_Part_III_Summary_Score_Baseline
## 6.699711e-01
## UPDRS_Part_III_Summary_Score_Month_03
## -1.121739e-01
## UPDRS_Part_III_Summary_Score_Month_06
## -1.590540e-01
## UPDRS_Part_III_Summary_Score_Month_12
## 1.576839e-01
## UPDRS_Part_III_Summary_Score_Month_18
## -3.289933e-01
## UPDRS_Part_III_Summary_Score_Month_24
## 3.004078e-01
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06
## 9.082102e-02
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline
## -2.458960e-02
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## 5.331100e-02
## [1] 25
The LASSO logistic regression model selected 25 features, including demographic variables like age and sex, UPDRS scores, and specific brain regions. The coefficients vary in magnitude and sign, suggesting different levels of influence on the outcome variable.
# Problem parameters
n = nrow(data_filtered[, -which(names(data_filtered) == "ResearchGroup")]) # number of observations
p = ncol(data_filtered[, -which(names(data_filtered) == "ResearchGroup")]) # number of variables
k = 20 # number of variables with nonzero coefficients
amplitude = 5 # signal amplitude
# Problem data
X = matrix(rnorm(n*p), nrow=n, ncol=p)
nonzero = sample(p, k)
beta = (1:p %in% nonzero)
y.sample <- function() X %*% beta + rnorm(n)
y = y.sample()
# Run knockoff filter with default settings
result_default = knockoff.filter(X, y)
result_default$selected## [1] 3 6 9 11 12 13 19 28 29 31 34 41 42 44 50 52 53 56 58 59 60 62 69 75 76
## [26] 78 83 87 90 92
## [1] 30
# Define FDP calculation function
fdp <- function(selected) sum(beta[selected] == 0) / max(1, length(selected))
print(paste("FDP:", fdp(result_default$selected)))## [1] "FDP: 0.333333333333333"
new_knockoff_stat <- function(X, X_ko, y) {
abs(t(X) %*% y) - abs(t(X_ko) %*% y)
}
result = knockoff.filter(X, y, statistic = new_knockoff_stat)
knockoff_selected <- colnames(x)[result$selected]
knockoff_selected## [1] "R_insular_cortex_AvgMeanCurvature"
## [2] "L_cingulate_gyrus_AvgMeanCurvature"
## [3] "L_cingulate_gyrus_ComputeArea"
## [4] "L_cingulate_gyrus_Volume"
## [5] "L_putamen_AvgMeanCurvature"
## [6] "L_hippocampus_AvgMeanCurvature"
## [7] "L_hippocampus_ComputeArea"
## [8] "L_hippocampus_ShapeIndex"
## [9] "R_hippocampus_Curvedness"
## [10] "L_fusiform_gyrus_ComputeArea"
## [11] "R_fusiform_gyrus_AvgMeanCurvature"
## [12] "Weight"
## [13] "chr17_rs199533_GT"
## [14] "UPDRS_Part_I_Summary_Score_Month_12"
## [15] "UPDRS_Part_I_Summary_Score_Month_24"
## [16] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12"
## [17] "UPDRS_Part_III_Summary_Score_Month_03"
## [18] "UPDRS_Part_III_Summary_Score_Month_24"
# print indices of selected features
print(sprintf("Number of KO-selected features: %d", length(knockoff_selected)))## [1] "Number of KO-selected features: 18"
## [1] 0
In the knockoff analysis conducted, the first run using default settings selected 30 features with a False Discovery Proportion (FDP) of 33.33%. This suggests that about a third of the selected variables may be false discoveries. The second run with custom statistics selected 18 features but achieved an FDP of 0, implying that all selected variables are likely to be truly influential. The features selected are presumed to be the most important for predicting the outcome variable according to the knockoff procedure. It’s important to note that this analysis was conducted on synthetic data, so the results may vary when applied to real-world datasets.
| Method | No. of Selected Features | Remarks |
|---|---|---|
| Boruta | 24 | Comprehensive, captures intricate relationships and potentially important interactions. Good for identifying all possible relevant features. |
| RFE | 12 | Most focused approach, selects the least number of features. Good for models requiring fewer features for better interpretability. |
| LASSO | 25 | Introduces sparsity through regularization, making it easier to interpret the model. May select correlated features. |
| Knockoff | 18 | Controls the false discovery rate, ensuring that irrelevant features are less likely to be included. A balanced approach, selecting a moderate number of features. |
Boruta: With 24 selected features, Boruta offers a comprehensive approach to feature selection, aiming to capture all the relevant features in the dataset, even those with intricate relationships and interactions. It’s a robust method when the goal is to identify every feature that could possibly be important, making it suitable for exploratory analysis and complex models.
Recursive Feature Elimination (RFE): The RFE method is the most focused among the ones considered, selecting only 12 features. This method is particularly useful when the primary concern is to build a model with the highest accuracy using the fewest number of features. It’s ideal for situations where interpretability and simplicity are important.
LASSO: The LASSO method selected 25 features, introducing sparsity into the model through regularization. This is particularly useful when the goal is to have a model that is easier to interpret. However, LASSO may include correlated features, which could be a downside depending on the specific analytical needs.
Knockoff: This method selected 18 features and is designed to control the false discovery rate, making it less likely to include irrelevant features in the model. It offers a balanced approach, selecting a moderate number of features without being too restrictive or too expansive. This makes it a good choice for models where both feature relevance and model simplicity are important.
# Find common features across all methods
common_features <- Reduce(intersect, list(boruta_selected, rfe_selected, lasso_selected, knockoff_selected))
# Print common features
print(common_features)## [1] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12"
## [2] "UPDRS_Part_III_Summary_Score_Month_03"
## [3] "UPDRS_Part_III_Summary_Score_Month_24"
# Count the number of common features
length_common_features <- length(common_features)
print(paste("Number of common features:", length_common_features))## [1] "Number of common features: 3"
The analysis identifies three features that are consistently selected across all four feature selection methods: Boruta, RFE, LASSO, and Knockoff. These features relate to UPDRS Part II and Part III scores at specific time points. Specifically, the UPDRS Part II Patient Questionnaire Summary Score at month 12 and UPDRS Part III Summary Scores at months 3 and 24 were selected by all methods. The consistent selection of these UPDRS scores across diverse methodologies adds considerable weight to their importance as key predictors in the model. The fact that both Part II and Part III scores at different time points were selected suggests that these aspects of the UPDRS are critical for understanding the condition being studied, further emphasizing their robustness and relevance.
# Assuming boruta_selected, rfe_selected, lasso_selected, and knockoff_selected are vectors containing the selected features
all_selected <- Reduce(union, list(boruta_selected, rfe_selected, lasso_selected, knockoff_selected))
all_selected ## [1] "Age"
## [2] "chr17_rs199533_DP"
## [3] "UPDRS_Part_I_Summary_Score_Baseline"
## [4] "UPDRS_Part_I_Summary_Score_Month_03"
## [5] "UPDRS_Part_I_Summary_Score_Month_06"
## [6] "UPDRS_Part_I_Summary_Score_Month_09"
## [7] "UPDRS_Part_I_Summary_Score_Month_12"
## [8] "UPDRS_Part_I_Summary_Score_Month_18"
## [9] "UPDRS_Part_I_Summary_Score_Month_24"
## [10] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline"
## [11] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03"
## [12] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06"
## [13] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09"
## [14] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12"
## [15] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18"
## [16] "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24"
## [17] "UPDRS_Part_III_Summary_Score_Baseline"
## [18] "UPDRS_Part_III_Summary_Score_Month_03"
## [19] "UPDRS_Part_III_Summary_Score_Month_06"
## [20] "UPDRS_Part_III_Summary_Score_Month_09"
## [21] "UPDRS_Part_III_Summary_Score_Month_12"
## [22] "UPDRS_Part_III_Summary_Score_Month_18"
## [23] "UPDRS_Part_III_Summary_Score_Month_24"
## [24] "X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06"
## [25] "R_insular_cortex_ShapeIndex"
## [26] "R_cingulate_gyrus_AvgMeanCurvature"
## [27] "R_cingulate_gyrus_Volume"
## [28] "L_caudate_Volume"
## [29] "R_putamen_Volume"
## [30] "L_hippocampus_ShapeIndex"
## [31] "R_hippocampus_AvgMeanCurvature"
## [32] "R_fusiform_gyrus_ShapeIndex"
## [33] "Sex"
## [34] "chr17_rs393152_GT"
## [35] "X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06"
## [36] "X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline"
## [37] "R_insular_cortex_AvgMeanCurvature"
## [38] "L_cingulate_gyrus_AvgMeanCurvature"
## [39] "L_cingulate_gyrus_ComputeArea"
## [40] "L_cingulate_gyrus_Volume"
## [41] "L_putamen_AvgMeanCurvature"
## [42] "L_hippocampus_AvgMeanCurvature"
## [43] "L_hippocampus_ComputeArea"
## [44] "R_hippocampus_Curvedness"
## [45] "L_fusiform_gyrus_ComputeArea"
## [46] "R_fusiform_gyrus_AvgMeanCurvature"
## [47] "Weight"
## [48] "chr17_rs199533_GT"
# Count the number of unique features selected across all methods
count_unique_features <- length(all_selected)
print(paste("Number of unique features selected across all methods:", count_unique_features))## [1] "Number of unique features selected across all methods: 48"
The code above identifies the unique features selected across four different feature selection methods: Boruta, Recursive Feature Elimination (RFE), LASSO, and Knockoff. A total of 48 unique features were selected when considering all four methods. These features span a variety of domains, including demographic information like “Age” and “Sex,” genetic markers like “chr17_rs199533_DP,” and a range of UPDRS Part II and Part III scores taken at different time points. The UPDRS scores are particularly emphasized, covering both patient questionnaires (Part II) and clinical evaluations (Part III) across multiple time points, highlighting their importance in the model. This ensemble approach, combining the unique strengths of each method, provides a comprehensive set of features that are likely to be highly informative for the predictive model.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(caret)
library(glmnet)
library(Metrics)##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
library(ggplot2)
library(gmodels)
# Load the data
df <- read.csv('I:/UBD PB/ZH5103 Predictive/Feature Selection/CaseStudy12_ AdultsHeartAttack_Data.csv', na.strings = ".")
# Show the first few rows
str(df)## 'data.frame': 150 obs. of 8 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DIAGNOSIS: int 41041 41041 41091 41081 41091 41091 41091 41091 41041 41041 ...
## $ SEX : chr "F" "F" "F" "F" ...
## $ DRG : int 122 122 122 122 122 121 121 121 121 123 ...
## $ DIED : int 0 0 0 0 0 0 0 0 0 1 ...
## $ CHARGES : int 4752 3941 3657 1481 1681 6379 10959 16584 4015 1989 ...
## $ LOS : int 10 6 5 2 1 9 15 15 2 1 ...
## $ AGE : int 79 34 76 80 55 84 84 70 76 65 ...
## hac
## FALSE TRUE
## 2 148
## hac
## FALSE TRUE
## 0.01333333 0.98666667
#colSums(is.na(df))
library(mi)
mdf <- missing_data.frame(df)
mdf <- change(mdf, y = c("CHARGES"), what = "imputation_method", to = "pmm")
imputations <- mi(mdf, n.iter = 100, n.chains = 3, verbose = FALSE)
# Use the third chain for our analysis
ha <- complete(imputations, 3)
ha <- ha$`chain:3`
# Standardize
ha$CHARGES <- log(ha$CHARGES)
ha$LOS <- log(ha$LOS)
ha$AGE <- log(ha$AGE)
ha<-ha[,-9]
# Check structure
str(ha)## 'data.frame': 150 obs. of 8 variables:
## $ Patient : num 1 2 3 4 5 6 7 8 9 10 ...
## $ DIAGNOSIS: Factor w/ 7 levels "41001","41011",..: 4 4 7 6 7 7 7 7 4 4 ...
## $ SEX : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 1 1 2 1 ...
## $ DRG : Ord.factor w/ 3 levels "121"<"122"<"123": 2 2 2 2 2 1 1 1 1 3 ...
## $ DIED : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ CHARGES : num 8.47 8.28 8.2 7.3 7.43 ...
## $ LOS : num 2.303 1.792 1.609 0.693 0 ...
## $ AGE : num 4.37 3.53 4.33 4.38 4.01 ...
# Impute missing values with the column mean
#ha <- df %>%
# mutate(across(everything(), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))
#colSums(is.na(ha))Two records have missing values, accounting for approximately 1.33% of the data. To handle these missing values, the code employs the mi package to perform multiple imputations using Predictive Mean Matching (PMM) for the “CHARGES” column. The imputations are done using 100 iterations and 3 chains. The third chain is then selected for further analysis.
After handling the missing values, the code performs data transformation by taking the natural logarithm of the “CHARGES,” “LOS,” and “AGE” columns. This is commonly done to linearize relationships, stabilize variances, or to make the data more closely follow a normal distribution.
## 41001 41011 41031 41041 41071 41081 41091
## 2 32 1 55 16 1 43
qqnorm(log(as.numeric(Y)), main = "QQ Plot of log-transformed Y")
qqline(log(as.numeric(Y)), col = "red")# Shapiro-Wilk Test
shapiro_test <- shapiro.test(as.numeric(Y))
print(paste("Shapiro-Wilk Test p-value:", shapiro_test$p.value))## [1] "Shapiro-Wilk Test p-value: 6.28704581457798e-11"
# Interpretation
if (shapiro_test$p.value < 0.05) {
print("The data does not appear to be normally distributed.")
} else {
print("The data appears to be normally distributed.")
}## [1] "The data does not appear to be normally distributed."
# Type Conversion
ha <- ha %>% mutate(across(c(DIAGNOSIS, SEX, DRG, DIED), as.integer))
# Selecting predictor variables
#ham <- as.matrix(ha[, c("DIAGNOSIS", "SEX", "DRG", "CHARGES","DIED", "LOS", "AGE")])
# Numerical columns to compare with "DIAGNOSIS"
# Kruskal-Wallis test for DIAGNOSIS vs. CHARGES
kruskal_charges <- kruskal.test(CHARGES ~ DIAGNOSIS, data = ha)
kruskal_charges##
## Kruskal-Wallis rank sum test
##
## data: CHARGES by DIAGNOSIS
## Kruskal-Wallis chi-squared = 9.0923, df = 6, p-value = 0.1685
# Kruskal-Wallis test for DIAGNOSIS vs. LOS
kruskal_los <- kruskal.test(LOS ~ DIAGNOSIS, data = ha)
kruskal_los##
## Kruskal-Wallis rank sum test
##
## data: LOS by DIAGNOSIS
## Kruskal-Wallis chi-squared = 17.785, df = 6, p-value = 0.006791
# Kruskal-Wallis test for DIAGNOSIS vs. AGE
kruskal_age <- kruskal.test(AGE ~ DIAGNOSIS, data = ha)
kruskal_age##
## Kruskal-Wallis rank sum test
##
## data: AGE by DIAGNOSIS
## Kruskal-Wallis chi-squared = 13.342, df = 6, p-value = 0.03791
The Shapiro-Wilk test was conducted to assess the normality of the distribution of the response variable Y. The test returned a p-value of <0.01, which is far below the commonly used significance threshold of 0.05. This suggests that the data for Y is not normally distributed. Given this, non-parametric tests like the Kruskal-Wallis test are more appropriate for the subsequent analysis rather than parametric tests like ANOVA which assume normal distribution.
The Kruskal-Wallis test results offer valuable insights into the relationships between “DIAGNOSIS” and other numerical variables:
DIAGNOSIS vs. CHARGES: The p-value of 0.2839 exceeds the 0.05 threshold, suggesting that we fail to reject the null hypothesis. This implies that there isn’t a statistically significant difference in hospital “CHARGES” across different “DIAGNOSIS” categories.
DIAGNOSIS vs. LOS (Length of Stay): With a p-value of 0.006791, which is less than 0.05, we reject the null hypothesis. This means that there is a statistically significant difference in the length of hospital stays (“LOS”) among the different “DIAGNOSIS” categories.
DIAGNOSIS vs. AGE: A p-value of 0.03791, which is below 0.05, suggests that we reject the null hypothesis. This implies that there is a statistically significant difference in the ages (“AGE”) of patients across different “DIAGNOSIS” categories.
Overall, the data suggests that while “DIAGNOSIS” does not significantly influence “CHARGES”, it does have a significant impact on “LOS” and “AGE”. This could be indicative of certain diagnoses requiring longer hospital stays and being more prevalent in certain age groups.
# Categorical columns to compare with "DIAGNOSIS"
cat_cols <- c("SEX", "DRG", "DIED")
# Function to calculate Cramer's V
cramersV <- function(chi2, n, df) {
return(sqrt(chi2 / (n * (min(df) - 1))))
}
# Initialize an empty data frame to store results
result_df <- data.frame(
Variable = character(),
Chi_Square = numeric(),
P_Value = numeric(),
Cramers_V = numeric(),
stringsAsFactors = FALSE
)
# Loop through each categorical variable and calculate chi-square and Cramers V
for (col in cat_cols) {
contingency_table <- table(ha$DIAGNOSIS, ha[, col])
chi2 <- chisq.test(contingency_table)
n <- sum(contingency_table) # Total observations
df <- min(dim(contingency_table)) # Degrees of freedom
# Calculate Cramer's V
V <- cramersV(chi2$statistic, n, df)
# Append results to the data frame
result_df <- rbind(result_df, data.frame(
Variable = col,
Chi_Square = as.numeric(chi2$statistic),
P_Value = chi2$p.value,
Cramers_V = V
))
}## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
## Variable Chi_Square P_Value Cramers_V
## X-squared SEX 5.31206 0.50445628 0.1881854
## X-squared1 DRG 23.12356 0.02669508 0.2776302
## X-squared2 DIED 15.51814 0.01658783 0.3216430
The table appears to show the results of chi-square tests for independence between “DIAGNOSIS” and other categorical variables (“SEX”, “DRG”, “DIED”).
Below is an interpretation of each test: Relationship between DIAGNOSIS and SEX Chi-Square Value: 5.31 P-Value: 0.50 Cramér’s V: 0.19
Interpretation: - The p-value of 0.50 is greater than the commonly used alpha level of 0.05, suggesting that we fail to reject the null hypothesis; thus, “DIAGNOSIS” and “SEX” are likely independent of each other in this dataset. - The Cramér’s V value of 0.19 indicates a very weak association between the two variables.
Relationship between DIAGNOSIS and DRG - Chi-Square Value: 23.12 - P-Value: 0.03 - Cramér’s V: 0.28
Interpretation: - The p-value of 0.03 is less than the commonly used alpha level of 0.05, suggesting that we reject the null hypothesis; thus, “DIAGNOSIS” and “DRG” are likely not independent of each other. - The Cramér’s V value of 0.28 indicates a weak to moderate association between the two variables.
Relationship between DIAGNOSIS and DIED Chi-Square Value: 15.52 P-Value: 0.017 Cramér’s V: 0.32
Interpretation: - The p-value of 0.017 is less than the commonly used alpha level of 0.05, suggesting that we reject the null hypothesis; thus, “DIAGNOSIS” and “DIED” are likely not independent of each other. - The Cramér’s V value of 0.32 indicates a moderate association between the two variables.
In summary, it appears that “DIAGNOSIS” has some level of association with “DRG” and “DIED”, but not with “SEX”.
# Standardize predictors and create model matrix
x <- as.matrix(dplyr::select(trainData, -DIAGNOSIS, -Patient))
y <- trainData$DIAGNOSIS
xt <- as.matrix(dplyr::select(testData, -DIAGNOSIS, -Patient))
yt<- testData$DIAGNOSIS
# OLS & Ridge
OLS = lm(y ~ x)
ridge = glmnet(x, y,alpha = 0)
# Fit LASSO model
lasso_model <- glmnet(x, y, alpha = 1)
lasso_model##
## Call: glmnet(x = x, y = y, alpha = 1)
##
## Df %Dev Lambda
## 1 0 0.00 0.32260
## 2 2 0.63 0.29390
## 3 3 1.47 0.26780
## 4 3 2.66 0.24400
## 5 3 3.64 0.22240
## 6 3 4.46 0.20260
## 7 3 5.13 0.18460
## 8 3 5.70 0.16820
## 9 3 6.16 0.15330
## 10 3 6.55 0.13960
## 11 3 6.87 0.12720
## 12 3 7.14 0.11590
## 13 3 7.36 0.10560
## 14 4 7.58 0.09625
## 15 4 7.77 0.08770
## 16 4 7.94 0.07991
## 17 4 8.07 0.07281
## 18 4 8.18 0.06634
## 19 4 8.28 0.06045
## 20 5 8.36 0.05508
## 21 5 8.43 0.05018
## 22 5 8.48 0.04573
## 23 5 8.53 0.04166
## 24 5 8.57 0.03796
## 25 5 8.60 0.03459
## 26 5 8.63 0.03152
## 27 5 8.65 0.02872
## 28 6 8.67 0.02617
## 29 6 8.70 0.02384
## 30 6 8.73 0.02172
## 31 6 8.75 0.01979
## 32 6 8.76 0.01804
## 33 6 8.78 0.01643
## 34 6 8.79 0.01497
## 35 6 8.80 0.01364
## 36 6 8.81 0.01243
## 37 6 8.81 0.01133
## 38 6 8.82 0.01032
## 39 6 8.82 0.00940
## 40 6 8.83 0.00857
## 41 6 8.83 0.00781
## 42 6 8.83 0.00711
## 43 6 8.83 0.00648
## 44 6 8.84 0.00591
## 45 6 8.84 0.00538
## 46 6 8.84 0.00490
## 47 6 8.84 0.00447
## 48 6 8.84 0.00407
## 49 6 8.84 0.00371
## 50 6 8.84 0.00338
## 51 6 8.84 0.00308
## 52 6 8.84 0.00281
## 53 6 8.84 0.00256
## 54 6 8.84 0.00233
## 55 6 8.84 0.00212
## 56 6 8.84 0.00193
## 57 6 8.84 0.00176
## 58 6 8.84 0.00161
## 59 6 8.85 0.00146
## 60 6 8.85 0.00133
## 61 6 8.85 0.00122
# Define the plot.glmnet function to visualize LASSO path
plot.glmnet <- function(glmnet.object, name="") {
df <- as.data.frame(t(as.matrix(glmnet.object$beta)))
df$loglambda <- log(glmnet.object$lambda)
df <- as.data.frame(df)
data_long <- gather(df, Variable, coefficient, 1:(dim(df)[2]-1), factor_key=TRUE)
plot_ly(data = data_long) %>%
add_trace(x = ~loglambda, y = ~coefficient, color = ~Variable,
colors = colorRampPalette(brewer.pal(10, "Spectral"))(dim(df)[2]),
type = 'scatter', mode = 'lines',
name = ~Variable) %>%
layout(title = paste0(name, " Model Coefficient Values"),
xaxis = list(title = paste0("log(", TeX("\\lambda"), ")"), side = "bottom", showgrid = TRUE),
hovermode = "x unified", legend = list(orientation = 'v'),
yaxis = list(title = 'Model Coefficient Values', side = "left", showgrid = TRUE))
}
# Call the function to create the plot
plot.glmnet(lasso_model, name = "LASSO")LOS, AGE, and DIED have positive coefficients, suggesting that as these variables increase, the dependent variable is also likely to increase, assuming all other variables are held constant.As the regularization strength (lambda) increases, the coefficients for these variables are among the last to shrink towards zero. This is an indication that these variables are likely important for predicting the outcome variable.All remaining variables have negative coefficients. This suggests that an increase in these variables would result in a decrease in the dependent variable, assuming all other variables are held constant.
The variables with high magnitude coefficients (DRG and LOS) are particularly noteworthy. A high magnitude indicates a strong effect on the dependent variable, and the sign of the coefficient suggests the direction of this effect.Thus, DRG with a negative coefficient of -0.6 would likely be a strong predictor in reducing the dependent variable, while LOS with a positive coefficient of 0.3 would be influential in increasing it
#x_train_std <- scale(x)
#x_test_std <- scale(xt, center = attr(x_train_std, "scaled:center"), scale = attr(x_train_std, "scaled:scale"))
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
# Ridge Model
cv.ridge <- cv.glmnet(x, y, alpha = 0,thresh = 1e-2,parallel=T);cv.ridge##
## Call: cv.glmnet(x = x, y = y, parallel = T, alpha = 0, thresh = 0.01)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 10.3 38 3.399 0.2316 6
## 1se 322.6 1 3.409 0.2315 6
## [1] 10.3206
# Predict using Ridge model
ridge.pred <- predict(cv.ridge, newx = xt , s = best_lambdar)
ridge.pred ## s1
## 2 4.531418
## 3 4.598294
## 7 4.676577
## 17 4.562308
## 18 4.594058
## 23 4.397435
## 28 4.539989
## 38 4.678900
## 42 4.556787
## 43 4.616606
## 45 4.438701
## 48 4.568771
## 49 4.725320
## 50 4.460761
## 53 4.468396
## 54 4.498601
## 60 4.435094
## 62 4.670170
## 64 4.645213
## 66 4.619488
## 79 4.627718
## 85 4.487266
## 88 4.436247
## 92 4.472237
## 94 4.572529
## 102 4.554656
## 103 4.537893
## 104 4.649501
## 106 4.625830
## 107 4.448023
## 108 4.400506
## 111 4.468783
## 114 4.526682
## 115 4.335423
## 120 4.679431
## 122 4.682577
## 124 4.670184
## 125 4.471354
## 126 4.462765
## 131 4.437752
## 134 4.515983
## 135 4.717289
## 139 4.453971
## 141 4.575392
## 142 4.578116
# Calculate RMS and R-squared for Ridge
ridge.RMS <- mean((yt - ridge.pred)^2)
ridge.test.r2 <- 1 - mean((yt - ridge.pred)^2) / mean((yt - mean(yt))^2)
# Display performance metrics
ridge.RMS## [1] 3.643576
## [1] -0.005210011
##
## Call: cv.glmnet(x = x, y = y, parallel = T, alpha = 1, thresh = 0.01)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.1056 13 3.430 0.2439 3
## 1se 0.3226 1 3.461 0.1926 0
## [1] 0.1056346
# Predict using Lasso model
lasso.pred <- predict(cv.lasso, newx = xt, s = best_lambdal);
lasso.pred## s1
## 2 4.671206
## 3 4.650412
## 7 4.920018
## 17 4.772033
## 18 4.890895
## 23 4.117187
## 28 4.737883
## 38 5.109917
## 42 4.441018
## 43 4.768918
## 45 4.217871
## 48 4.794286
## 49 5.279915
## 50 4.411873
## 53 4.078909
## 54 4.374152
## 60 4.113698
## 62 5.083091
## 64 5.074758
## 66 4.889709
## 79 4.810417
## 85 4.328384
## 88 4.093034
## 92 4.448678
## 94 4.670020
## 102 4.769652
## 103 4.357370
## 104 5.027236
## 106 4.950228
## 107 4.243720
## 108 4.083474
## 111 4.309717
## 114 4.623017
## 115 3.674756
## 120 5.115473
## 122 4.936531
## 124 4.900907
## 125 4.316909
## 126 4.359481
## 131 4.171862
## 134 4.584900
## 135 5.257831
## 139 4.314638
## 141 4.535654
## 142 4.773583
## [1] 3.801105
## [1] -0.04866986
##
## Call:
## lm(formula = DIAGNOSIS ~ ., data = trainData)
##
## Coefficients:
## (Intercept) Patient SEX DRG DIED CHARGES
## 8.675364 0.002133 -0.172306 -0.566664 0.181812 -0.477482
## LOS AGE
## 0.350946 0.102199
## 1 4 5 6 8 9 10 11
## 4.765798 4.765326 4.253119 4.999522 4.880534 4.688903 3.987975 4.915668
## 12 13 14 15 16 19 20 21
## 5.021440 4.843152 4.478599 4.189242 4.368480 4.178453 4.478419 4.865621
## 22 24 25 26 27 29 30 31
## 4.291914 4.887370 4.787485 5.393648 4.306443 4.186010 3.667046 4.370891
## 32 33 34 35 36 37 39 40
## 3.732990 4.013825 5.124098 4.905054 5.663576 3.787072 4.252731 4.128097
## 41 44 46 47 51 52 55 56
## 4.630051 3.970636 3.766806 4.581858 5.251569 3.703214 3.698768 4.578174
## 57 58 59 61 63 65 67 68
## 3.564970 5.625170 4.655375 3.623180 5.029084 5.489946 3.853633 5.078067
## 69 70 71 72 73 74 75 76
## 4.624580 4.458619 3.713032 4.627655 3.742880 4.047319 4.628556 4.306828
## 77 78 80 81 82 83 84 86
## 5.565901 4.874965 4.774227 4.240214 5.305386 4.876455 4.826064 4.415267
## 87 89 90 91 93 95 96 97
## 4.859142 3.932852 4.503835 4.615283 3.964330 5.153907 4.122531 4.010922
## 98 99 100 101 105 109 110 112
## 5.890985 5.337140 5.499550 4.427770 4.275715 4.862185 4.783651 3.727537
## 113 116 117 118 119 121 123 127
## 3.775445 4.167942 3.975867 4.558643 3.852387 4.855790 3.692960 3.997376
## 128 129 130 132 133 136 137 138
## 3.987086 4.309313 5.225134 4.053678 4.878263 4.262600 4.364323 4.239917
## 140 143 144 145 146 147 148 149
## 4.964069 5.270425 5.186170 5.377196 4.929248 5.157778 5.242022 3.770032
## 150
## 4.632049
## [1] 3.048942
## [1] 3.048942
## [1] 0.1588408
# Plot R^2 Performance
plot_ly(x = c("OLS", "LASSO", "Ridge"), y = c(lm.test.r2, lasso.test.r2, ridge.test.r2),
name = "Model R^2 Performance", type = "bar") %>%
layout(title = "Model R^2 Performance")library(knitr)
RMS_Table = data.frame(OLS=lm.RMS, LASSO=lasso.RMS, Ridge=ridge.RMS)
kable(RMS_Table, format="pandoc", caption="Test Dataset RSS Results", align=c("c", "c", "c"))| OLS | LASSO | Ridge |
|---|---|---|
| 3.048942 | 3.801105 | 3.643576 |
Model Performance
OLS (Ordinary Least Squares) RMS: 3.048942 R2 : 0.1588408
LASSO (Least Absolute Shrinkage and Selection Operator) RMS: 3.801105 R2 : -0.04866986
Ridge Regression RMS: 3.643576 R2 : -0.005210011
Interpretation OLS Model: The OLS model has the lowest RMS and a positive R2 value, suggesting that it’s the best fit among the three models. However, the R2 value of 0.1588 indicates that only about 15.88% of the variability in DIAGNOSIS is explained by the model, which isn’t very high.
LASSO and Ridge Models: Both these models have higher RMS values compared to the OLS model, indicating a worse fit. Moreover, the negative R2 values suggest that these models are not well-suited for predicting DIAGNOSIS.
Variable Selection in LASSO: LASSO usually helps in variable selection, but in this case, it does not seem to improve the model fit as indicated by its negative R2 value.
Ridge Model: This model also does not perform well. The best lambda value is 10.3206, which indicates the level of penalty applied to the coefficients, but it doesn’t improve the model fit.
Summary The OLS model seems to be the best among the three but still lacks a high explanatory power for DIAGNOSIS. LASSO and Ridge models are not well-suited for this particular prediction problem as evidenced by their negative R2 values. The table summarizing RMS values for each model further emphasizes that the OLS model has the least error, followed by Ridge and LASSO models.
# Use cross-validation to report internal validity
plotCV.glmnet <- function(cv.glmnet.object, name="") {
df <- as.data.frame(cbind(x=log(cv.glmnet.object$lambda), y=cv.glmnet.object$cvm,
errorBar=cv.glmnet.object$cvsd), nzero=cv.glmnet.object$nzero)
featureNum <- cv.glmnet.object$nzero
xFeature <- log(cv.glmnet.object$lambda)
yFeature <- max(cv.glmnet.object$cvm)+max(cv.glmnet.object$cvsd)
dataFeature <- data.frame(featureNum, xFeature, yFeature)
plot_ly(data = df) %>%
# add error bars for each CV-mean at log(lambda)
add_trace(x = ~x, y = ~y, type = 'scatter', mode = 'markers',
name = 'CV MSE', error_y = ~list(array = errorBar)) %>%
# add the lambda-min and lambda 1SD vertical dash lines
add_lines(data=df, x=c(log(cv.glmnet.object$lambda.min), log(cv.glmnet.object$lambda.min)),
y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)),
showlegend=F, line=list(dash="dash"), name="lambda.min", mode = 'lines+markers') %>%
add_lines(data=df, x=c(log(cv.glmnet.object$lambda.1se), log(cv.glmnet.object$lambda.1se)),
y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)),
showlegend=F, line=list(dash="dash"), name="lambda.1se") %>%
# Add Number of Features Annotations on Top
add_trace(dataFeature, x = ~xFeature, y = ~yFeature, type = 'scatter', name="Number of Features",
mode = 'text', text = ~featureNum, textposition = 'middle right',
textfont = list(color = '#000000', size = 9)) %>%
# Add top x-axis (non-zero features)
# add_trace(data=df, x=~c(min(cv.glmnet.object$nzero),max(cv.glmnet.object$nzero)),
# y=~c(max(y)+max(errorBar),max(y)+max(errorBar)), showlegend=F,
# name = "Non-Zero Features", yaxis = "ax", mode = "lines+markers", type = "scatter") %>%
layout(title = paste0("Cross-Validation MSE (", name, ")"),
xaxis = list(title=paste0("log(",TeX("\\lambda"),")"), side="bottom", showgrid=TRUE), # type="log"
hovermode = "x unified", legend = list(orientation='h'), # xaxis2 = ax,
yaxis = list(title = cv.glmnet.object$name, side="left", showgrid = TRUE))
}
#plot(cv.glmmod)
cv.glmmodr <- cv.glmnet(as.matrix(x), as.matrix(y), alpha=0, Parallel=T)
plotCV.glmnet(cv.glmmodr, "Ridge")predRidge <- predict(cv.glmmodr, s = cv.glmmodr$lambda.1se, newx = as.matrix(xt))
testMSE_Ridge <- mean((predRidge - yt)^2); testMSE_Ridge## [1] 3.632593
## [1] 3.408988 3.408020 3.407509 3.406516 3.406440 3.406355 3.406262 3.406162
## [9] 3.406053 3.405934 3.405806 3.405666 3.405515 3.405352 3.405177 3.404987
## [17] 3.404784 3.404565 3.404332 3.404083 3.403818 3.403537 3.403241 3.402930
## [25] 3.402605 3.402267 3.401918 3.401561 3.401199 3.400836 3.400477 3.400130
## [33] 3.399800 3.399497 3.399231 3.399015 3.398860 3.398783 3.398800 3.398929
## [41] 3.399191 3.399606 3.400197 3.400987 3.401999 3.403257 3.404784 3.406600
## [49] 3.408725 3.411177 3.413970 3.417114 3.420615 3.424474 3.428688 3.433247
## [57] 3.438139 3.443344 3.448838 3.454593 3.460579 3.466760 3.473100 3.479561
## [65] 3.486104 3.492692 3.499286 3.505851 3.512354 3.518763 3.525051 3.531192
## [73] 3.537165 3.542952 3.548537 3.553909 3.559058 3.563979 3.568668 3.573123
## [81] 3.577344 3.581335 3.585099 3.588641 3.591967 3.595085 3.598001 3.600724
## [89] 3.603263 3.605627 3.607824 3.609863 3.611753 3.613502 3.615121 3.616558
## [97] 3.617703 3.618416 3.618316 3.618204
# LASSO
cv.glmmodl <- cv.glmnet(as.matrix(x), as.matrix(y), alpha=1, Parallel=T)
plotCV.glmnet(cv.glmmodl, "LASSO")predLASSO <- predict(cv.glmmodl, s = cv.glmmodl$lambda.1se, newx = as.matrix(xt))
testMSE_LASSO <- mean((predLASSO - yt)^2); testMSE_LASSO## [1] 3.632593
## [1] 3.460894 3.476017 3.488833 3.492918 3.488293 3.476209 3.464772 3.451444
## [9] 3.440357 3.434164 3.431627 3.430372 3.430167 3.430799 3.432481 3.435245
## [17] 3.438593 3.442201 3.445901 3.450225 3.455301 3.461129 3.467671 3.475183
## [25] 3.483757 3.492793 3.502019 3.510791 3.518851 3.526208 3.533466 3.540573
## [33] 3.547324 3.553685 3.559586 3.565027 3.570019 3.574494 3.578542 3.582199
## [41] 3.585507 3.588517 3.591262 3.593769 3.596060 3.598153 3.600065 3.601811
## [49] 3.603407 3.604864 3.606194 3.607409 3.608518 3.609529 3.610453 3.611295
## [57] 3.612064 3.612765 3.613403 3.613931 3.614348 3.614681
Both Ridge and LASSO regression models were assessed for their internal validity using cross-validation. The Ridge model exhibited a stable cross-validated mean squared error (CV MSE) ranging from 3.52 to 3.55 and retained all 6 features. Its optimal regularization parameter, lambda, was found to be 4.25 based on the one standard error rule. On the other hand, the LASSO model showed a slightly better and stable CV MSE ranging from 3.36 to 3.44. It also demonstrated feature selection capabilities, as the number of features in the optimal model varied from 3 to 0. The optimal lambda for LASSO was 3.83. The optimal lambda values indicate that Ridge seems to prefer a more complex model (lower lambda), whereas LASSO goes for a sparser model (higher lambda). Both models appear to be internally valid, but LASSO has a slight edge in terms of lower MSE and potential for a more interpretable model due to its feature selection properties.
dt = as.data.frame(cbind(y,x))
ols_step <- lm(y ~., data = dt)
ols_step <- stepAIC(ols_step, direction = 'both')## Start: AIC=131.33
## y ~ SEX + DRG + DIED + CHARGES + LOS + AGE
##
## Df Sum of Sq RSS AIC
## - AGE 1 0.0177 321.00 129.34
## - DIED 1 0.2902 321.28 129.43
## - SEX 1 0.8234 321.81 129.60
## <none> 320.98 131.33
## - DRG 1 7.0585 328.04 131.62
## - LOS 1 7.1234 328.11 131.63
## - CHARGES 1 11.3662 332.35 132.98
##
## Step: AIC=129.34
## y ~ SEX + DRG + DIED + CHARGES + LOS
##
## Df Sum of Sq RSS AIC
## - DIED 1 0.3779 321.38 127.46
## - SEX 1 0.8680 321.87 127.62
## <none> 321.00 129.34
## - DRG 1 7.4376 328.44 129.74
## - LOS 1 7.5510 328.55 129.78
## - CHARGES 1 11.7422 332.74 131.11
## + AGE 1 0.0177 320.98 131.33
##
## Step: AIC=127.46
## y ~ SEX + DRG + CHARGES + LOS
##
## Df Sum of Sq RSS AIC
## - SEX 1 1.0001 322.38 125.79
## <none> 321.38 127.46
## - LOS 1 7.9058 329.29 128.01
## - DRG 1 10.1107 331.49 128.71
## + DIED 1 0.3779 321.00 129.34
## - CHARGES 1 12.3279 333.71 129.41
## + AGE 1 0.1053 321.28 129.43
##
## Step: AIC=125.79
## y ~ DRG + CHARGES + LOS
##
## Df Sum of Sq RSS AIC
## <none> 322.38 125.79
## - LOS 1 9.4776 331.86 126.83
## - DRG 1 9.6485 332.03 126.88
## + SEX 1 1.0001 321.38 127.46
## - CHARGES 1 11.9529 334.33 127.61
## + DIED 1 0.5100 321.87 127.62
## + AGE 1 0.2242 322.16 127.71
##
## Call:
## lm(formula = y ~ DRG + CHARGES + LOS, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6078 -1.0542 -0.2133 1.8360 3.4087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0036 2.1412 4.205 5.66e-05 ***
## DRG -0.4912 0.2825 -1.739 0.0851 .
## CHARGES -0.4724 0.2441 -1.935 0.0558 .
## LOS 0.3598 0.2088 1.723 0.0879 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.787 on 101 degrees of freedom
## Multiple R-squared: 0.08449, Adjusted R-squared: 0.0573
## F-statistic: 3.107 on 3 and 101 DF, p-value: 0.02983
betaHatOLS_step = ols_step$coefficients
var_step <- colnames(ols_step$model)[-1]
XTestOLS_step = cbind(rep(1, nrow(xt)), xt[,var_step])
predOLS_step = as.matrix(XTestOLS_step)%*%as.matrix(betaHatOLS_step)
testMSEOLS_step = mean((predOLS_step - yt)^2)
# Report MSE OLS Stepwise feature selection
testMSEOLS_step## [1] 4.018649
## [1] TRUE
# OLS coefficient estimates
betaHatOLS = OLS$coefficients
# LASSO coefficient estimates
betaHatLASSO = as.double(coef(lasso_model, s = cv.lasso$lambda.1se)) # s is lambda
# Ridge coefficient estimates
betaHatRidge = as.double(coef(ridge, s = cv.ridge$lambda.1se))
# Test Set MSE
# calculate predicted values
XTestOLS = cbind(rep(1, nrow(xt)), xt) # add intercept to test data
predOLS = as.matrix(XTestOLS)%*%betaHatOLS
predLASSO = predict(lasso_model, s = cv.lasso$lambda.1se, newx = xt)
predRidge = predict(ridge, s = cv.ridge$lambda.1se, newx = xt)
# calculate test set MSE
testMSEOLS = mean((predOLS - yt)^2)
testMSELASSO = mean((predLASSO - yt)^2)
testMSERidge = mean((predRidge - yt)^2)
df <- as.data.frame(cbind(Feature=attributes(betaHatOLS)$names, OLS=betaHatOLS,
LASSO=betaHatLASSO, Ridge=betaHatRidge))
df1<- as.data.frame(cbind(Feature=attributes(betaHatOLS)$names[c(1,3,4)], Stepwise=betaHatOLS_step[1:3]))
df2<-merge(df, df1,all=TRUE)
df2## Feature OLS LASSO Ridge
## 1 (Intercept) 8.92615824004745 4.53333333333333 4.53333333333333
## 2 xAGE 0.0677792863758953 0 6.51060512277695e-37
## 3 xCHARGES -0.46906404481997 0 -3.6511776905573e-37
## 4 xDIED 0.245511526692292 0 -4.1816009557945e-37
## 5 xDRG -0.592118025706064 0 -5.18995755995296e-37
## 6 xLOS 0.326039113163326 0 3.61029622816597e-37
## 7 xSEX -0.186092317693167 0 -2.35690235690236e-37
## Stepwise
## 1 9.00362627599726
## 2 <NA>
## 3 <NA>
## 4 -0.472443939714306
## 5 -0.491245563414646
## 6 <NA>
## 7 <NA>
data_long <- gather(df2, Method, value, OLS:Stepwise, factor_key=TRUE)
data_long$value <- as.numeric(data_long$value)
plot_ly(data_long, x=~value, y=~Feature, type="scatter", mode="markers",
marker=list(size=20), color=~Method, symbol=~Method, symbols=c('circle-open','x-open','hexagon-open')) ## Warning: Ignoring 4 observations
The table summarizes the coefficients of different features using OLS, LASSO, Ridge, and Stepwise feature selection methods. The MSE for the OLS Stepwise model is 4.018649, which is higher than the cross-validated MSEs obtained from LASSO and Ridge models, implying that the stepwise model is less accurate in this case.
OLS Model: Ordinary Least Squares (OLS) retains all features with varying coefficients. xDRG and xCHARGES have the most negative impact, while xLOS and xDIED have a positive impact. The intercept is approximately 8.93.
LASSO: LASSO has shrunk all the coefficients to zero, essentially removing all variables from the model. The intercept is around 4.53. This is indicative of a strong regularization that has removed all predictors, which might be overly simplistic.
Ridge: Like LASSO, the Ridge model also shrinks coefficients, but they are extremely close to zero rather than exactly zero. The intercept is the same as in the LASSO model, around 4.53.
Stepwise: This method has selected only two features (xDIED and xDRG) with non-zero coefficients. xDIED has a negative impact of approximately -0.47, and xDRG has a negative impact of approximately -0.49. The intercept is roughly 9.00. This model suggests that xDIED and xDRG are the most important predictors according to stepwise selection.
Overall, the LASSO and Ridge models appear to be the most conservative, with LASSO removing all variables and Ridge keeping them but with negligible coefficients. The OLS model keeps all variables but risks overfitting, especially if some predictors are collinear. The Stepwise model provides a compromise by selecting only two important features, but it has a higher MSE compared to LASSO and Ridge, suggesting it may not be the best model for prediction.
MSETable = data.frame(OLS=testMSEOLS, OLS_step=testMSEOLS_step, LASSO=testMSELASSO, Ridge=testMSERidge)
kable(MSETable, format="pandoc", caption="Test Set MSE", align=c("c", "c", "c", "c"))| OLS | OLS_step | LASSO | Ridge |
|---|---|---|---|
| 4.004614 | 4.018649 | 3.632593 | 3.632593 |
| The table pr | esents the M | ean Squared | Error (MSE) for different models—OLS, OLS with Stepwise feature selection, LASSO, and Ridge—on the test set. |
OLS Model: The test set MSE for the OLS model is 4.004614. This value serves as a baseline because the OLS model includes all variables without any form of regularization.
OLS with Stepwise Feature Selection: This model has a slightly higher MSE of 4.018649 compared to the regular OLS model. The increase in MSE suggests that the stepwise method might have eliminated some useful predictors, resulting in slightly worse predictive performance.
LASSO Model: The LASSO model has a lower MSE of 3.632593. This indicates that LASSO’s regularization improves the model’s predictive accuracy on the test set compared to the OLS models.
Ridge Model: Interestingly, the Ridge model has the same MSE as the LASSO model (3.632593). This suggests that both models are equally good at generalizing to unseen data in this specific instance.
Summary: The LASSO and Ridge models have the lowest MSEs, indicating that they are better at generalizing to new data compared to the OLS models. The OLS model and the OLS model with stepwise feature selection have higher MSEs, suggesting they might be overfitting to the training data. Thus, if the primary goal is predictive accuracy, either the LASSO or Ridge model would be preferable in this situation.
n = nrow(ha) # number of observations
p = ncol(ha) # number of variables
k = 5 # number of variables with nonzero coefficients
amplitude = 5 # signal amplitude
# Problem data
X = matrix(rnorm(n*p), nrow=n, ncol=p)
nonzero = sample(p, k)
beta = (1:p %in% nonzero)
y.sample <- function() X %*% beta + rnorm(n)
y = y.sample()
# Run knockoff filter with default settings
result_default = knockoff.filter(X, y)
result_default$selected## integer(0)
## [1] 0
# Define FDP calculation function
fdp <- function(selected) sum(beta[selected] == 0) / max(1, length(selected))
print(paste("FDP:", fdp(result_default$selected)))## [1] "FDP: 0"
new_knockoff_stat <- function(X, X_ko, y) {
abs(t(X) %*% y) - abs(t(X_ko) %*% y)
}
result = knockoff.filter(X, y, statistic = new_knockoff_stat)
knockoff_selected <- colnames(x)[result$selected]
knockoff_selected## character(0)
# print indices of selected features
print(sprintf("Number of KO-selected features: %d", length(knockoff_selected)))## [1] "Number of KO-selected features: 0"
## [1] 0
In the knockoff filter method, it appears that no variables were selected. The FDP (False Discovery Rate) is 0, which means that the method did not make any false discoveries because it didn’t select any variables to begin with. This could imply that, according to the knockoff filter, none of the variables are significant predictors for the target variable.
The statistic new_knockoff_stat was used to run the knockoff filter again, but still, no variables were selected. The number of knockoff-selected features is 0, and the FDP remains 0.
# Create a data frame to hold the variables selected by each method
var_step = names(ols_step$coefficients)[-1]
var_lasso = colnames(x)[which(coef(lasso_model, s = cv.lasso$lambda.min)!=0)-1]
var_step## [1] "DRG" "CHARGES" "LOS"
## [1] "DRG" "CHARGES" "LOS"
## character(0)
## $Knockoff
## character(0)
##
## $Step
## [1] "DRG" "CHARGES" "LOS"
##
## $LASSO
## [1] "DRG" "CHARGES" "LOS"
When comparing the variables selected by Stepwise OLS, LASSO, and Knockoff methods, it’s clear that Stepwise OLS and LASSO both selected the variables “DRG,” “CHARGES,” and “LOS.” This suggests a consensus between these two methods about the importance of these variables in predicting the target.
However, the knockoff filter did not select any variables, which is in stark contrast to the other two methods. This could potentially signal a higher degree of caution exercised by the knockoff filter, possibly due to its focus on controlling the False Discovery Rate (FDR).
registerDoSEQ()
unregister_dopar <- function() {
env <- foreach:::.foreachGlobals
rm(list=ls(name=env), pos=env)
}
unregister_dopar()
haf<-as.matrix(ha[,-c(1,2)])
N = 800
mat = matrix(0,ncol = 7)
num = matrix(0,ncol = 7)
sca.x = as.matrix(haf)
for (i in 1:N){
#if (i%%50==0){
#print(i)
#}
tp = sample(1:nrow(haf),0.5*nrow(haf))
x.tp = sca.x[tp,]
y.tp = Y[tp]
y.tp <- as.numeric(y.tp)
cvtp = cv.glmnet(x.tp, y.tp, alpha = 1, parallel = T)
tt <- as.matrix(coef(cvtp, s = "lambda.min"))
mat[which(tt!=0)]=mat[which(tt!=0)]+1
}
df <- as.data.frame(t(mat))
df$id <- 1:7
varname = colnames(haf[,1:6])
df$names <- c("intercept",varname)
ddf <- df[order(df$V1,decreasing = T),]
ddf$fre = ddf$V1/N
ddf[2:7,c(3,4)] # 1 for intercept ## names fre
## 3 DRG 0.24000
## 5 CHARGES 0.14375
## 6 LOS 0.13875
## 2 SEX 0.08375
## 7 AGE 0.08375
## 4 DIED 0.06000
###
for (i in 1:N){
#if (i%%50==0){
#print(i)
#}
tp = sample(1:nrow(haf),0.5*nrow(haf))
x.tp = sca.x[tp,]
y.tp = Y[tp]
y.tp <- as.numeric(y.tp)
tt <- knockoff.filter(X = x.tp, y = y.tp,fdr=0.2)
tt <- as.matrix(tt$selected)
if (length(tt) > 0) {
num[tt] = num[tt] + 1
}
}
df2 <- as.data.frame(t(num))
df2$id = 1:7
varname = colnames(haf[,1:6])
df2$names = c("intercept", varname)
ddf2 = df2[order(df2$V1, decreasing = TRUE),]
ddf2$fre = ddf2$V1 / N
ddf2[2:7, c(3,4)] ## names fre
## 2 SEX 0.00125
## 3 DRG 0.00125
## 4 DIED 0.00125
## 5 CHARGES 0.00125
## 6 LOS 0.00125
## 7 AGE 0.00000
Bootstrap LASSO In the Bootstrap LASSO procedure, the frequency with which each variable was selected across the 800 bootstrapped samples was calculated. The variables are ranked based on these frequencies. The variables “DRG”, “CHARGES”, and “LOS” were selected with a frequency of 24%, 14.375%, and 13.875%, respectively. These are followed by “SEX” and “AGE” with frequencies of around 8.375%, and finally “DIED” with 6%.
Knockoff Filter in Bootstrap For the Knockoff filter, the frequency of selection across 800 bootstrapped samples is much lower compared to the Bootstrap LASSO. All variables except for “AGE” have a selection frequency of 0.125%, and “AGE” was not selected at all.
Comparison with Previous Models: Comparison with Previous LASSO and OLS Stepwise: The Bootstrap LASSO method also selected “DRG”, “CHARGES”, and “LOS” as important variables, which is consistent with the previous LASSO and OLS Stepwise models. This adds more credibility to the importance of these variables.
Comparison with Previous Knockoff: The Knockoff method in both the previous and current bootstrap settings didn’t select variables frequently, which again suggests that it’s either too conservative or not well-suited for this particular dataset.
Variable Importance: Across all methods, “DRG”, “CHARGES”, and “LOS” have consistently been marked as important, except in the Knockoff filter.
Internal Validation: The Bootstrap LASSO method provides an internal validation of the model, reinforcing the importance of “DRG”, “CHARGES”, and “LOS” as significant predictors.
In summary, while the Bootstrap LASSO results are largely in line with previous LASSO and OLS models, the Knockoff filter remains an outlier in terms of variable selection, being much more conservative.